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Abstract 

In a recent paper, Parikh and Boyd describe a method for solving a convex opti¬ 
mization problem, where each iteration involves evaluating a proximal operator and 
projection onto a subspace. In this paper we address the critical practical issues of how 
to select the proximal parameter in each iteration, and how to scale the original prob¬ 
lem variables, so as the achieve reliable practical performance. The resulting method 
has been implemented as an open-source software package called POGS (Proximal 
Graph Solver), that targets multi-core and GPU-based systems, and has been tested 
on a wide variety of practical problems. Numerical results show that POGS can solve 
very large problems (with, say, more than a billion coefficients in the data), to modest 
accuracy in a few tens of seconds. As just one example, a radiation treatment planning 
problem with around 100 million coefficients in the data can be solved in a few seconds, 
as compared to around one hour with an interior-point method. 


1 Introduction 

We consider the convex optimization problem 

minimize f{y) + g{x) 
snbject to y = Ax, 

where x G R" and y G R™' are the variables, and the (extended-real-valned) fnnctions 
/ : R™ —)■ R U {cxo} and g : R"' —)■ R U {cxo} are convex, closed and proper. The matrix 
A G and the fnnctions / and g are the problem data. Inhnite valnes of / and g allow 

ns to encode convex constraints on x and y, since any feasible point (x, y) mnst satisfy 

X G {x I g{x) < CX)}, y e {y \ fiy) < oo}. 

We will be interested in the case when / and g have simple proximal operators, bnt for now 
we do not make this assnmption. The problem form ([T]) is known as graph form |PB13aj . 
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since the variable (x,?/) is constrained to lie in the graph ^ = {(x,?/) G | y = Ax} of 

A. We denote p* as the optimal value of ([T]), which we assume is hnite. 

The graph form includes a large range of convex problems, including linear and quadratic 
programming, general conic programming |BV04t §11.6], and many more specihc applications 
such as logistic regression with various regularizers, support vector machine htting [HTF09] , 
portfolio optimization jBV04| §4.4.1] [GM75] |BMOWl^ . and radiation treatment planning 
|QW06] . to name just a few. 

In |PB13a] . Parikh and Boyd described an operator splitting method for solving the 
graph form problem ([1]), based on the alternating direction method of multipliers (ADMM) 
[BPC+llj . Each iteration of this method requires a projection (either exactly or approxi¬ 
mately via an iterative method) onto the graph Q, and evaluation of the proximal operators of 
/ and g. Theoretical convergence was established in that paper, and basic implementations 
demonstrated. However it has been observed that practical convergence of the algorithm 
depends very much on the choice of algorithm parameters (such as the proximal parameter 
p), and scaling of the variables {i.e., pre-conditioning). 

The purpose of this paper is to explore these issues, and to add some critical variations 
on the algorithm that make it a relatively robust general purpose solver, at least for modest 
accuracy levels. The algorithm we propose, which is the same as the basic method described 
in |PB13aj . with modihed parameter selection, diagonal pre-conditioning, and modihed stop¬ 
ping criterion, has been implemented in an open-source software project called BOGS (for 
Proximal Graph Solver), and tested on a wide variety of problems. Our GUDA imple¬ 
mentation reliably solves (to modest accuracy) problems 1000 x larger than those that can 
be handled by interior-point methods; and for those that can be handled by interior-point 
methods, 100 x faster. As a single example, a radiation treatment planning problem with 
more than 100 million coefficients in A can be solved in a few seconds; the same problem 
takes around one hour to solve using an interior-point method. 


1.1 Outline 

In 1 11.21 we describe related work. In 1|2] we derive the graph form dual problem, and the 
primal-dual optimality conditions, which we use to motivate the stopping criterion and to 
interpret the iterates of the algorithm. In 1|3] we describe the ADMM-based graph form 
algorithm, and analyze the properties of its iterates, giving some results that did not appear 
in |PB13aj . In l|l]we address the topic of pre-conditioning, and suggest novel pre-conditioning 
and parameter selection techniques. In l|5]we describe our implementation POGS, and in 1|6] 
we report performance results on various problem families. 


1.2 Related work 

Many generic methods can be used to solve the graph form problem ([1]), including projected 
gradient descent [GM87] . projected subgradient methods |Pol87i Ghap. 5] |Sho98j . operator 
splitting methods [LM79] |ES08j . interior-point methods [NW991 Ghap. 19] |BTN01l Ghap. 
6] and many more. (Of course many of these methods can only be used when additional 
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assumptions are made on / and g, e.g., differentiability or strong convexity.) For example, 
if / and g are separable and smooth (or have smooth barrier functions for their epigraphs), 
the problem ([T]) can be solved by an interior-point method, which in practice always takes 
no more than a few tens of iterations, with each iteration involving the solution of a system 
of linear equations that requires 0(max{m, u} min{m, n}^) flops when A is dense [BV04( 
Chap. 11] |NW99| Chap. 19]. 

We now turn to hrst-order methods for the graph form problem ([T]). In |OV14j O’Connor 
and Vandenberghe propose a primal-dual method for the graph form problem where A is the 
sum of two structured matrices. They contrast it with methods such as Spingarn’s method of 
partial inverses |Spi85| , Douglas-Rachford splitting |DR56] , and the Chambolle-Pock method 
[CPII] . 

Davis and Yin |DY 14] analyze convergence rates for different operator splitting methods, 
and in |Gisl5j Giselsson proves the tightness of linear convergence for the operator splitting 
problems considered [GB14b] . Goldstein et ah |GOSB14] derive Nesterov-type acceleration, 
and show 0{l/k‘^) convergence for problems where / and g are both strongly convex. 

Nishihara et ah |NLR'*~15] introduce a parameter selection framework for ADMM with 
over relaxation |EB92j . The framework is based on solving a hxed-size semidehnite program 
(SDP). They also make the assumption that / is strongly convex. Ghadimi et ah |GTSJ13] 
derive optimal parameter choices for the case when / and g are both quadratic. In [GB14b] 
Giselsson and Boyd show how to choose metrics to optimize the convergence bound, and 
in |GB14aj Giselsson and Boyd suggest a diagonal pre-conditioning scheme for graph form 
problems based on semidehnite programming. This scheme is primarily relevant in small to 
medium scale problems, or situations where many different graph form problems, with the 
same matrix A, are to be solved. 

It is clear from these papers (and indeed, a general rule) that the practical convergence 
of hrst-order mthods depends heavily on algorithm parameter choices. All of these papers 
make additional assumptions about the objective, which we do not. 

GPUs are used extensively for training neural networks jNCL'*'ll|. lCMM'*'lT| IKSH12( 
IGHW+13j . and they are slowly gaining popularity in convex optimization as well |PG1H 
IG()PB131IWBT4] . 


2 Optimality conditions and duality 

2.1 Dual graph form problem 

The Lagrange dual function of ([I]) is given by 

inf f{y) + g{x) + v'^{Ax -y) = - g*{-A^v) 

^,y 

where u G R” is the dual variable associated with the equality constraint, and f* and g* are 
the conjugate functions of / and g respectively [BV041 Chap. 4]. Introducing the variable 
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II = -A^v, we can write the dual problem as 

maximize -/*(z^) - ^ 2 ) 

subject to yU = —A^iy. 

The dual problem can be written as a graph form problem if we negate the objective and 
minimize rather than maximize. The dual graph form problem ([2]) is related to the primal 
graph form problem ([1]) by switching the roles of the variables, replacing the objective 
function terms with their conjugates, and replacing A with —A'^. 

The primal and dual objectives are p(x, y) = f{y) + g{x) and (i(/r, z/) = —/*(z/) — g*{y) 
respectively, giving us the duality gap 

g = p{x, y) - d{p, v) = f{y) + f*{u) + g{x) + g*{p). (3) 

We have g > 0, for any primal and dual feasible tuple {x,y,p, u). The duality gap g gives 
a bound on the suboptimality of (x, y) (for the primal problem) and also (/r, u) for the dual 
problem; 

fiv) + gix) <p* + g, - g*{.g) >p"-g- 

2.2 Optimality conditions 

The optimality conditions for ([T]) are readily derived from the dual problem. The tuple 
(x, y, yU, u) satishes the following three conditions if and only it is optimal. 

Primal feasibility: 


Dual feasibility: 


y = Ax. 


p = —A^v. 


(4) 

(5) 


Zero gap: 


fiv) + /■(") + s{x) + g’M = 0. 


( 6 ) 


If both (jlD and ([5]) hold, then the zero gap condition ([6]) can be replaced by the Fenchel 
equalities 


f{y) + FA) = g{x) + g*iy) = (7) 

We refer to a tuple (x, ?/, yU, z/) that satisfies ([7]) as Fenehel feasible. To verify the statement, 
we add the two equations in ([7]), which yields 

f{y) + f*F) + gA) + g*{y) = y'^^ + = {Ax)'^u - x'^a^u = o. 
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The Fenchel equalities ([7]) are is also equivalent to 


uedf{y), yedg{x), (8) 

where d denotes the subdifferential, which follows because 

z/ e df{y) ^ sup - f{z)) = i/y - f{y) ^ f{y) + f*{u) = v'^y. 

Z 

In the sequel we will assume that strong duality holds, meaning that there exists a tuple 
(x*, 2 /*, /i*, V*) which satishes all three optimality conditions. 

3 Algorithm 

3.1 Graph projection splitting 

In |PB13a] Parikh et ah apply ADMM [BPC"*"!!! §5] to the problem of minimizing f{y) + 
g{x), subject to the constraint (x,?/) G Q. This yields the graph projection splitting algo¬ 
rithm [H 

Algorithm 1 Graph projection splitting 

Input: A,f,g 

1: Initialize {x^, y^, x^, y^) = 0, k = 0 

2: repeat 

3: 7/*^+^/^):= (proXg(x^ — x^), proxj(|/^ — 

4: := -1-x^, y^+^/2-f 

5: := (x^ -1- - x^+\ y’^ _ y^+^) 

6: k ■.= k + 1 

7: until converged 


The variable k is the iteration counter, 3 ;^+!^ 3 ;^+i /2 ^ yk+i^yk+ 1 / 2 ^ ^ pj,-_ 

mal variables, G R"" and y^~'~^ G R™ are scaled dual variables, 11 denotes the (Euclidean) 
projection onto the graph 

prox^(n) = argmin (^f(y) + (p/ 2 ) \\y - v\\l'^ 


is the proximal operator of / (and similarly for g), and p > 0 is the proximal parameter. 
The projection 11 can be explicitly expressed as the linear operator 


n(c, d) 


K-^ 


c + A^d 
0 


K 


7 

A -I 


(9) 


Roughly speaking, in steps 3 and 5, the x (and x) and y (and y) variables do not mix; 
the computations can be carried out in parallel. The projection step 4 mixes the x, x and 
p, y variables. 
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General convergence theory for ADMM [BPC"*"!! 
sumption on the existence of a solution) 

^fc+l) _ ^^k+l/2^ yk+1/2^ ^ ^ g^^k^ 


§3.2] guarantees that (with our as- 
^ {x\y*), ( 10 ) 


as fc —)■ cxD. 


3.2 Extensions 

We discuss three common extensions that can be used to speed up convergence in practice: 
over-relaxation, approximate projection, and varying penalty. 

Over-relaxation. Replacing by + (1 — a)x^ in the projection and dual 

update steps is known as over-relaxation if a > 1 or under-relaxation if a < 1. The algorithm 
is guaranteed to converge jEB92j for any a G (0,2); it is observed in practice jQSB13j 
|AHW12] that using an over-relaxation parameter in the range [1.5, 1.8] can improve practical 
convergence. 

Approximate projection. Instead of computing the projection 11 exactly one can use an 
approximation IT, with the only restriction that 

- fI(a:'^+^/^ 2 /'^+^/ 2)||2 < oo. 

This is known as approximate projection |OSB13] . This extension is particularly useful if 
the approximate projection is computed using an indirect or iterative method. 


Varying penalty. Large values of p tend to encourage primal feasibility, while small values 
tend to encourage dual feasibility IBPC"*"!!! §3.4.1]. A common approach is to adjust or vary 


p in each iteration, so that the primal and dual residuals are (roughly) balanced in magnitude. 
When doing so, it is important to re-scale by a factor p^/p^^^. 


3.3 Feasible iterates 

In each iteration, algorithm [T] produces sets of points that are either primal, dual, or Fenchel 
feasible. Dehne 

/ = -ph^ ^ _p(a;fc+l/2 _ ^k+l/2 ^ _p(^fc+l/2 _ykj^ yky 

The following statements hold. 

1. The pair is primal feasible, since it is the projection onto the graph Q. 
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2. The pair is dual feasible, as long as (/i°, is dual feasible and {x^,y^) is 

primal feasible. Dual feasibility implies = 0, which we show using the 

update equations in algorithm d) 

l^k+l J(r^k+l ^ + xk+l/2 _ ^k+l ^T(^~k yk+l/2 _ 

= -p{x^ + - {I + A^A)x^"^^), 

where we substituted = Ax^^^. From the projection operator in ([H]) it follows 
that (/ + AA"A)x^^^ = therefore 

yk+i X^yk+i ^ _y(^~k = p!" + A^p^ = Ai° + 

where the last equality follows from an inductive argument. Since we made the as¬ 
sumption that (/r°, z/°) is dual feasible, we can conclude that is also dual 

feasible. 

3. The tuple ^ ^k+i/ 2 ^^k+i/ 2 \^ jg Fenchel feasible. From the dehnition of the 

proximal operator, 

^k+i /2 ^ (g{x) + {p/'2) ||x — x^ + \ 0 G dg{x^~^^^‘^) + — x^ + x^) 

^ ^^+1/2 g dg{x^"^^l’^). 

By the same argument G 

Applying the results in (fTOj) to the dual variables, we hud _). ^y* and —)■ p*, 

from which we conclude that (a;^''“^Z2^ p^’''^Z2^ ^fc-i- 1 / 2 ^ jg primal and dual feasible in the 

limit. 

3.4 Stopping criteria 

In ^3.31 we noted that either (jH 0 ED or (jH O ED are sufficient for optimality. We present 
two different stopping criteria based on these conditions. 

Residual based stopping. The tuple (^a;fc+iZ 2 ^ ^fc+ 1 / 2 ^ ^fc+ 1 / 2 ^ jg pg^chel feasible in 

each iteration, but only primal and dual feasible in the limit. Accordingly, we propose the 
residual based stopping criterion 

||y4a;fc+lZ2 _ /+V2||2 < gpri^ ||yl'^zy^+V2 + /+V2||2 < gdual^ 

where the and are positive tolerances. These should be chosen as a mixture of 
absolute and relative tolerances, such as 

gpri _ gabs ^rel||^fc+l/2||^^ ^dual _ ^abs ^ ^rel||^fc+1/2 ||^_ 

Reasonable values for and e'’®* are in the range [10“^, 10“^]. 
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Gap based stopping. The tuple z/^) is primal and dual feasible, but only 

Fenchel feasible in the limit. We propose the gap based stopping criteria 

h" = /(2/1+ 

where should be chosen relative to the current objective value, he., 

£**'” = e*‘” + €'"l/(!/'‘)+s(i‘)|. 

Here too, reasonable values for and e'’®* are in the range [10“^^, 10“^]. 

Although the gap based stopping criteria is very informative, since it directly bounds the 
suboptimality of the current iterate, it suffers from the drwaback that /, g, f* and g* must 
all have full domain, since otherwise the gap can be inhnite. Indeed, the gap is almost 
always inhnite when f or g represent constraints. 


3.5 Implementation 


Projection. There are different ways to evaluate the projection operator H, depending on 
the structure and size of A. 

One simple method that can be used if A is sparse and not too large is a direct sparse 
factorization. The matrix K is quasi-dehnite, and therefore the LDLA decomposition is 
well dehned |Van95j . Since K does not change from iteration to iteration, the factors L 
and D (and the permutation or elimination ordering) can be computed in the hrst iteration 
(e.^f., using CHOLMOD [CDHR08] ) and re-used in subsequent iterations. This is known as 
factorization caching [BPC'*~11|. §4.2.3] jPB13a( §A.l]. With factorization caching, we get a 
(potentially) large speedup in iterations, after the hrst one. 

If A is dense, and min(m, n) is not too large, then block elimination |BV04( Appendix 
C] can be applied to K |PB13al Appendix A], yielding the reduced update 


;= (A'^A I)~^{c A'^d) 

yk+i _ 


if m > n, or 


■=d+ {AA^ + I)-\Ac - d) 

0 :*^+^ := c - A^{d - /+^) 

if m < n. Both formulations involve forming and solving a system of equations in 
Since the matrix is symmetric positive dehnite, we can use the Cholesky decomposition. 
Forming the coefficient matrix A^A -|- / or AA^ -|-1 dominates the computatation. Here too 
we can take advantage of factorization caching. 

The regular structure of dense matrices allows us to analyze the computational complexity 
of each step. We dehne q = min(m, n) and p = max(m,n). The hrst iteration involves the 
factorization and the solve step; subsequent iterations only require the solve step. The 
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computational cost of the factorization is the combined cost of computing A (or ^4^4^, 
whichever is smaller), at a cost of pq^ flops, in addition to the Cholesky decomposition, at a 
cost of (l/3)g^ flops. The solve step consists of two matrix-vector multiplications at a cost 
of 4pg flops and solving a triangular system of equations at a cost of flops. The total cost 
of the hrst iteration is 0{pq^) flops, while each subsequent iteration only costs 0{pq) flops, 
showing that we obtain a savings by a factor of q flops, after the hrst iteration, by using 
factorization caching. 

For very large problems direct methods are no longer practical, at which point indirect 
(iterative) methods can be used. Fortunately, as the primal and dual variables converge, 
we are guaranteed that meaning that we will have a good 

initial guess we can use to initialize the iterative method to (approximately) evaluate the 
projection. One can either apply CGLS (conjugate gradient least-squares) |HS52] or LSQR 
|PS82j to the reduced update or apply MINRES (minimum residual) |PS75j to K directly. It 
can be shown the latter requires twice the number of iterations as compared to the former, 
and is therefore not recommended. 

Proximal operators. Since the x, x and y, y components are decoupled in the proximal 
step and dual variable update step, both of these can be done separately, and in parallel for 
X and y. If either f or g is separable, then the proximal step can be parallelized further. The 
monograph |PB13b] details how proximal operators can be computed efficiently for a wide 
range of functions. Typically the cost of computing the proximal operator will be negligible 
compared to the cost of the projection. In particular, if / and g are separable, then the cost 
will be 0{m + n), and completely parallelizable. 

4 Pre-conditioning and parameter selection 

The practical convergence of the algorithm (he., the number of iterations required before it 
terminates) can depend greatly on the choice of the proximal parameter p, and the scaling 
of the variables. In this section we analyze these, and suggest a method for choosing p and 
for scaling the variables that (empirically) speeds up practical convergence. 

4.1 Pre-conditioning 

Consider scaling the variables x and y in ([1]), by E~^ and D respectively, where D G 
and E G are non-singular matrices. We dehne the scaled variables 

y = Dy, X = E~^x, 


which transforms ([T]) into 


minimize f{D ^y) + g{Ex) 
subject to y = DAEx. 


( 12 ) 
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This is also a graph form problem, and for notational convenience, we define 

A = DAE, f{y) = f{D-^y), g{x) = g{Ex), 

so that the problem can be written as 

minimize f{y) + g{x) 
subject to y = Ax. 

We refer to this problem as the pre-conditioned version of ([T]). Our goal is to choose D and E 
so that (a) the algorithm applied to the pre-conditioned problem converges in fewer steps in 
practice, and (b) the additional computational cost due to the pre-conditioning is minimal. 

Graph projection splitting applied to the pre-conditioned problem (IT^ can be interpreted 
in terms of the original iterates. The proximal step iterates are redefined as 

^fc+i /2 ^ argmin {g{x) + {p/2)\\x - x^ 3 A\\Ieet)-^ 

/+V2 = argmin {j{y) + {p/2)\\y - / + > 

and the projected iterates are the result of the weighted projection 

minimize (l/2)||x - + ( 1 / 2 )|| 2 / - d) 

subject to y = Ax, 


where ||x||p = \/ x"^Px for a symmetric positive-definite matrix P. This projection can be 
expressed as 


n(c,d) = k 


1 

\EEk~^c + A^D'^Dd 

jk _ 

\EEk~^ 

A^D'^D 


0 

5 - 

D^DA 

-D^D 


Notice that graph projection splitting is invariant to orthogonal transformations of the 
variables x and y, since the pre-conditioners only appear in terms of D^D and EE'^. In 
particular, if we let D = U'^ and E = V, where A = U'EV'^, then the pre-conditioned con¬ 
straint matrix A = DAE = E is diagonal. We conclude that any graph form problem can 
be pre-conditioned to one with a diagonal non-negative constraint matrix E. For analysis 
purposes, we are therefore free to assume that A is diagonal. We also note that for orthog¬ 
onal pre-conditioners, there exists an analytical relationship between the original proximal 
operator and the pre-conditioned proximal operator. With (/)(x) = (p(Qx), where Q is any 
orthogonal matrix {Q^Q = QQ^ = ki have 


prox^(n) = Q^prox^(Qn). 


While the proximal operator of 0 is readily computed, orthogonal pre-conditioners destroy 
separability of the objective. As a result, we can not easily combine them with other pre¬ 
conditioners. 
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Multiplying D by a scalar a and dividing E by the same scalar has the effect of scaling 
p by a factor of o?. It however has no effect on the projection step, showing that p can be 
thought of as the relative scaling of D and E. 

In the case where / and g are separable and both D and E are diagonal, the proximal 
step takes the simplihed form 

= argmin {gj{xj) + {pf I2){xj - x) ^ x)f) j = 1,..., n 

Xj 

^ argmin (/*(?/*) + (pf/2)(pi - yf + yff) i = 

Vi 

where pf = p/E‘jj and pf = pDf^. Since only p is modihed, any routine capable of computing 
prox^ and prox^ can also be used to compute the pre-conditioned proximal update. 


4.1.1 Effect of pre-conditioning on projection 

For the purpose of analysis, we will assume that A = H, where S is a non-negative diagonal 
matrix. The projection operator simplihes to 


n(c, d) 


■ (J + S^S)-1 


c 



d 


which means the projection step can be written explicitly as 


= 


1 


1 + a/ 


hr. 


k+l/2 


+ Xi + (Ti{yi ^ ^ + Pi)) 


+ x’l 


= 


(Ji 


1 -h a • 




% 


fc+i 


= 0 


1 < i < min(m, n) 

min(m, n) < i < n 
1 < i < min(m, n) 

min(m, n) < i < m, 


where ai is the ith diagonal entry of S and subscripted indices of x and y denote the ith 
entry of the respective vector. Notice that the projected variables x\'^^ and are equally 
dependent on + x\) and ai{y^'^^^‘^ -h pf). If Uj is either signihcantly smaller or larger 

than 1, then the terms and pf^^ will be dominated by either or (pf''"^^^+pf). 

However if cXi = 1, then the projection step exactly averages the two quantities 




1 < i < min(m, n). 


As we pointed out in 1|3|, the projection step mixes the variables x and p. For this to 


approximately reduce to averaging, we need cxj ~ 1. 
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4.1.2 Choosing D and E 

The analysis suggests that the algorithm will be fast when the singular values of DAE are 
all near one, i.e., 


cond(^DAE) ^ 1, ||Zldi ?||2 ~ 1. 


(13) 


(This claim is also supported in |GB14c] . and is consistent with our computational experi¬ 
ence.) Pre-conditioners that exactly satisfy these conditions can be found using the singular 
value decomposition of A. They will however be of little use, since such pre-conditioners 
generally destroy our ability to evaluate the proximal operators of / and g efficiently. 

So we seek choices of D and E for which flT5]l holds (very) approximately, and for which 
the proximal operators of / and g can still be efficiently computed. We now specialize to 
the special case when / and g are separable. In this case, diagonal D and E are candidates 
for which the proximal operators are still easily computed. (The same ideas apply to block 
separable / and g, where we impose the further constraint that the diagonal entries within 
a block are the same.) So we now limit ourselves to the case of diagonal pre-conditioners. 

Diagonal matrices that minimize the condition number of DAE, and therefore approxi¬ 
mately satisfy the hrst condition in ([13]), can be found exactly, using semidehnite program¬ 
ming |BEGFB94l §3.1]. But this computation is quite involved, and may not be worth the 


computational effort since the conditions (IT^ are just a heuristic for faster convergence. 
(For control problems, where the problem is solved many times with the same matrix A, this 
approach makes sense; see |GB14aj .l 

A heuristic that tends to minimize the condition number is to equilibrate the matrix, 
choose D and E so that the rows all have the same p-norm, and the columns all have 


i.e. 


the same p-norm. (Such a matrix is said to be equilibrated.) This corresponds to Ending D 
and E so that 

\DAE\n = al, 1^\DAE\P = 


where a, /5 > 0. Here the notation | • should be understood in the elementwise sense. Various 
authors |()SB13j . |G()PB13] . [B ralOj suggest that equilibration can decrease the number of 
iterations needed for operator splitting and other first order methods. One issue that we 
need to address is that not every matrix can be equilibrated. Given that equilibration is 
only a heuristic for achieving ai{DAE) ^ 1, which is in turn a heuristic for fast convergence 
of the algorithm, partial equilibration should serve the same purpose just as well. 

Sinkhorn and Knopp |SK67] suggest a method for matrix equilibration for p < oo, and A 
is square and has full support. In the case p = oo, the Ruiz algorithm |Rui01] can be used. 
Both of these methods fail (as they must) when the matrix A cannot be equilibrated. We 
give below a simple modification of the Sinkhorn-Knopp algorithm, modified to handle the 
case when A is non-square, or cannot be equilibrated. 

Ghoosing pre-conditioners that satisfy ||DA £'||2 = 1 can be achieved by scaling D and 
E by a m»^ (DAE)~'^ and ama.x{DAEy~^ respectively for g G R. The quantity a m»x (DAE) 
can be approximated using power iteration, but we have found it is unnecessary to exactly 
enforce ||DAF ||2 = 1. A more computationally efficient alternative is to replace a r„„^ (DAE) 
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by ||DAi?||i?/A/min(m, n). This quantity coincides with amax.{DAE) when coTid{DAE) = 
1. If DAE is equilibrated and p = 2, this scaling corresponds to {DAE)'^{DAE) (or 
{DAE){DAE)^ when m < n) having unit diagonal. 


4.2 Regularized equilibration 


In this section we present a self-contained derivation of our matrix-equilibration method. It 
is similar to the Sinkhorn-Knopp algorithm, but also works when the matrix is non-square 
or cannot be exactly equilibrated. 

Consider the convex optimization problem with variables u and n, 


m n 

minimize 

i=l j=l 


nl^u 


ml^v + 7 


i=l j=l 


(14) 


where 7 > 0 is a regularization parameter. The objective is bounded below for any 7 > 0. 
The optimality conditions are 

n 

— n + (l/m) 7 e“' = 0 , i = 1,... ,m 
j=i 

m 

\Aijfe^^~^'^^ — m + {l/n)'je^^ = 0 , j = 1 ,..., n. 

^=1 


By dehning Da = and Ejj = these conditions are equivalent to 

\DAE\n + {l/m)-fDl = nl, 1'^\DAE\p + (l/njyl^E = ml'^. 


When 7 = 0 , these are the conditions for a matrix to be equilibrated. The objective may 
not be bounded when 7 = 0 , which exactly corresponds to the case when the matrix cannot 
be equilibrated. As 7 —)■ 00 , both D and E converge to the scaled identity matrix (mn/ 7 )/, 
showing that 7 can be thought of as a regularizer on the elements of D and E. If D and E 
are optimal, then the two equalities 

1^\DAE\^1 -|- (l/m) 7 l^Zll = mn, 1^\DAE\^1 + (l/njql'^i?! = mn 


must hold. Subtracting the one from the other, and dividing by 7 , we hnd the relationship 

(l/m)l^Zll = (l/n)l^i?l. 


implying that the average entry in D and E is the same. 

There are various ways to solve the optimization problem ([HD, one of which is to apply 
coordinate descent. Minimizing the objective in fflTD with respect to Ui yields 


^ \Aij\P + (7/m)e“? = n 77 

i=i 


n 
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and equivalently for Vj 


m 


Er=ie“">ol^ + (7/n)' 

Since the minimization with respect to uf is independent of the update can be done in 
parallel for each element of u, and similarly for v. Repeated minimization over u and v will 
eventually yield values that satisfy the optimality conditions. Algorithm [2] summarizes the 
equilibration routine. 


Algorithm 2 Regularized Sinkhorn-Knopp 

Input: A, e > 0, 7 > 0 
1: Initialize e° := 1, k := 0 
2: repeat 
3: k := k + 1 

4: := n diag(|A|Pe^“^ + (y/m)!)”^! 

5: := m diagdA^^l^d^ + (y/n)!)”^! 

6: until ||e^ — e ^~^\\2 < e and \\d^ — d ^~^\\2 < e 
7: return D := diag((i^)^/^, E := diag(e^)^/^ 


4.3 Adaptive penalty update 

The projection operator If does not depend on the choice of p, so we are free to update p in 
each iteration, at no extra cost. While the convergence theory only holds for hxed p, it still 
applies if one assumes that p becomes hxed after a hnite number of iterations |BPC'*'llj . 

As a rule, increasing p will decrease the primal residual, while decreasing p will decrease 
the dual residual. The authors in |HYWnnj . |BPC'*~ll] suggest adapting p to balance the 
primal and dual residuals. We have found that substantially better practical convergence 
can be obtained using a variation on this idea. Rather than balancing the primal and dual 
residuals, we allow either the primal or dual residual to approximately converge and only 
then start adjusting p. Based on this observation, we propose the following adaptive update 
scheme. 

Once either the primal or dual residual converges, the algorithm begins to steer p in a 
direction so that the other residual also converges. By making small adjustments to p, we 
will tend to remain approximately primal (or dual) feasible once primal (dual) feasibility has 
been attained. Additionally by requiring a certain number of iterations between an increase 
in p and a decrease (and vice versa), we enforce that changes to p do not flip-flop between 
one direction and the other. The parameter r determines the relative number of iterations 
between changes in direction. 
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Algorithm 3 Adaptive p update 

Input: <5>1, re(0,l], 

1: Initialize / := 0, u := 0 
2: repeat 

3: Apply graph projection splitting 

4: if IIand rk > I then 

5: := Sp’^ 

6: u := k 

7: else if ||Aa;^’''^/^ — and rk > u then 

8: p^+^ := (l/(5)p^ 

9: I := k 

10: until II and ||Aa;^+^/^ — 


5 Implementation 

Proximal Graph Solver (POGS) is an open-source (BSD-3 license) implementation of graph 
projection splitting, written in G++. It supports both GPU and GPU platforms and includes 
wrappers for G, MATLAB, and R. POGS handles all combinations of sparse/dense matrices, 
single/double precision arithmetic, and direct/indirect solvers, with the exception (for now) 
of sparse indirect solvers. The only dependency is a tuned BLAS library on the respective 
platform {e.g., cuBLAS or the Apple Accelerate Framework). The source code is available 
at 


https://github.com/cvxgrp/pogs 

In lieu of having the user specify the proximal operators of / and g, POGS contains a 
library of proximal operators for a variety of different functions. It is currently assnmed that 
the objective is separable, in the form 

m n 

f{y) + 9{x) = fiivi) + 

i=l j=l 

where fi,gj ; R — > R U {cxd}. The library contains a set of base fnnctions, and by applying 
various transformations, the range of functions can been greatly extended. In particular we 
use the parametric representation 

MVi) = CihiiaiVi - hi) + diVi + (l/2)ei|//, 

where ai,bi,di G R, Ci,ej G R+, and hjiR—)-RU{cx)}. The same representation is also 
used for gj. It is straightforward to express the proximal operators of /* in terms of the 
proximal operator of hi using the formula 

prox^(n) = i (^pTO^h,ie+p)/{ca^) (« “ d) /(c + p) - fe) + 6 
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where for notational simplicity we have dropped the index i in the constants and functions. 
It is possible for a user to add their own proximal operator function, if it is not in the current 
library. We note that the separability assumption on / and g is a simplification, rather than 
a limitation of the algorithm. It allows us to apply the proximal operator in parallel using 
either CUBA or OpenMP (depending on the platform). 

The constraint matrix is equilibrated using algorithm [2l with a choice of p = 2 and 
7 = (m + n)\/e^™P, where is machine epsilon. Both D and E are rescaled evenly, so that 
they satisfy ||BAi?||ir/-\/min(m, n) = 1. The projection fl is computed as outlined in 1 13.51 
We work with the reduced update equations in all versions of POGS. In the indirect case, 
we chose to use CGLS. The parameter p is updated according to algorithm [3l Empirically, 
we found that {6, r) = (1.05, 0.8) works well. We also use over-relaxation with a = 1.7. 

POGS supports warm starting, whereby an initial guess for and/or may be supplied 
by the user. If only is provided, then will be estimated, and vice-versa. The warm-start 
feature allows any cached matrices to be used to solve additional problems with the same 
matrix A. 

POGS returns the tuple since it has hnite primal and dual 

objectives. The primal and dual residuals will be non-zero and are determined by the spec¬ 
ified tolerances. 

Future plans for POGS include extension to block-separable / and g (including general 
cone solvers), additional wrappers for Julia and Python, support for a sparse direct solver, 
and a multi-GPU extension. 


6 Numerical results 

To highlight the robustness and general purpose nature of POGS, we tested it on 9 different 
problem classes using random data, as well as a radiation treatment planning problem using 
real-world data. 

All experiments were performed in single precision arithmetic on a machine equipped 
with an Intel Gore 17-870, 16GB of RAM, and a Tesla K40 GPU. Timing results include the 
data copy from GPU to GPU. 

We compare POGS to SDPT3 |TTT99] . an open-source solver that handles linear, second- 
order, and positive semidehnite cone programs. Since SDPT3 uses an interior-point algo¬ 
rithm, the solution returned will be of high precision, allowing us to verify the accuracy of 
the solution computed by POGS. Problems that took SDPT3 more than 150 seconds (of 
which there were many) were aborted. 

6.1 Random problem classes 

We considered the following 9 problem classes: Basis pursuit. Entropy maximization, Huber 
htting. Lasso, Logistic regression. Linear programming. Non-negative least-squares. Portfo¬ 
lio optimization, and Support vector machine htting. For each problem class, reasonable 
random instance were generated and solve; details about problem generation can be found 
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in Appendix For each problem class the number of non-zeros in A was varied on a loga¬ 
rithmic scale from 100 to 2 Billion. The aspect ratio of A also varied from 1:10 to 10:1, with 
the orientation (wide or tall) chosen depending on what was reasonable for each problem. 
We report running time averaged over all aspect ratios. 

The maximum number of iterations was set to 10^, but all problems converged in fewer 
iterations, with most problems taking a couple of hundred iterations. The relative tolerance 
was set to 10“^, and where solutions from SDPT3 were available, we verihed that the solu¬ 
tions produced by both solvers matched to 3 decimal places. We omit SDPT3 running times 
for problems involving exponential cones, since SDPT3 does not support them. 

Figure [1] compares the running time of POGS versus SDPT3, for problems where the 
constraint matrix A is dense. We can make several general observations. 

• POGS solves problems that are 3 orders of magnitude larger than SDPT3 in the same 
amount of time. 

• Problems that take 200 seconds in SDPT3 take 0.5 seconds in POGS. 

• POGS can solve problems with 2 Billion non-zeros in 10-50 seconds. 

• The variation in solve time across different problem classes was similar for POGS and 
SDPT3, around one order of magnitude. 

In summary, POGS is able to solve much larger problems, much faster (to moderate preci¬ 
sion) . 

6.2 Radiation treatment planning 

Radiation treatment is used to radiate tumor cells in cancer patients. The goal of radiation 
treatment planning is to hnd a set of radiation beam intensities that will deliver a specihed 
radiation dosage to tumor cells, while minimizing the impact on healthy cells. The problem 
can be stated directly in graph form, with x corresponding to the n beam intensities to 
be found, y corresponding to the radiation dose received at the m voxels, and the matrix 
A (whose elements are non-negative) giving the mapping from the beams to the received 
dosages at the voxels. This matrix comes from geometry, including radiation scattering 
inside the patient |AHIM06] . The objective g is the indicator function of the non-negative 
orthant (which imposes the constraint that Xj > 0), and / is a separable function of the 
form 

„ . \ _ / Vi * corresponds to a non-tumor voxel 

\ w~ max((ij — yi, 0) -|- wf max(|/j — dj, 0) i corresponds to a tumor voxel, 

where wf > 0 is the (given) weight associated with overdosing voxel i, where w~ > 0 is the 
(given) weight associated with underdosing voxel i, and d* > 0 is the target dose, given for 
each tumor voxel. We can also add the redundant constraint yi > 0 hy dehning fi{yi) = oo 
for yi < 0. 
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Figure 1: POGS (GPU version) vs. SDPT3 for dense matrices (color represents problem class). 


We present results for one instance of this problem, with m = 360000 voxels and n = 360 
beams. The matrix A comes from a real patient, and the objective parameters are chosen 
to achieve a good clinical plan. The problem is small enough that it can be solved (to high 
accuracy) by an interior-point method, in around one hour. POGS took a few seconds to 
solve the problem, producing a solution that was extremely close to the one produced by the 
interior-point method. In warm start mode, POGS could solve problem instances (obtained 
by varying the objective parameters) in under one second, allowing for real-time tuning of 
the treatment plan (by adjusting the objective function weights) by a radiation oncologist. 
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A Problem generation details 

In this section we describe how the problems in §6.11 were generated. 
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A.l Basis pursuit 

The basis pursuit problem |CDS98] seeks the smallest vector in the £i-norm sense that 
satisfies a set of underdetermined linear equality constraints. The objective has the effect of 
finding a sparse solution. It can be stated as 

minimize ||x||i 
subject to b = Ax, 


with equivalent graph form representation 


minimize I{y = b) + ||x||i 
subject to y = Ax. 


The elements of A were generated as Aij 
vector V G R"" as 


Vi ~ 


0 

AfiO,l/n) 


~ A/'(0,1)- To construct b we first generated a 

with probability p = 1/2 
otherwise, 


we then let b = Av. In each instance we chose m > n. 


A.2 Entropy maximization 

The entropy maximization problem [BV04] seeks a probability distribution with maximum 
entropy that satisfies a set of m affine inequalities, which can be interpreted as bounds on 
the expectations of arbitrary fnnctions. It can be stated as 

maximize — 

subject to l^x = 1, Ax < b, 


with equivalent graph form representation 


minimize / {yi-m < b) + I{ym+i 

'A' 

1 ^ 


subject to y = 


X. 


1 ) + 


The elements of A were generated as A^j ~ J\f{0,n). To construct b, we first generated a 
vector V G R"" as Vi ~ U[0, 1], then we set b = Fv/{l'^v). This ensures that there exists a 
feasible x. In each instance we chose m < n. 


A.3 Huber fitting 

Huber fitting or robust regression |Hnb64] performs linear regression under the assumption 
that there are outliers in the data. The problem can be stated as 

minimize ^™^hnber(6i — ajx), 
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where the Huber loss function is defined as 


huber(a;) 


{l/2)x^ |a;| < 1 

|a;| — (1/2) |a;| > 1 


The graph form representation of this problem is 

minimize ^/^^huber( 6 j — yi) 
subject to y = Ax. 

The elements of A were generated as Aij ~ To construct b, we hrst generated a 

vector V G R"" as Vi ~ Af{0, 1 /n) then we generated a noise vector e with elements 

f Af{0, 1/4) with probability p = 0.95 
\ f/[ 0 , 10 ] otherwise. 

Lastly we constructed b = Av + £. In each instance we chose m> n. 


A.4 Lasso 

The lasso problem |Tib96] seeks to perform linear regression under the assumption that the 
solution is sparse. An penalty is added to the objective to encourage sparsity. It can be 
stated as 


minimize ||Aa; — 6||2 + A||x||i, 
with graph form representation 

minimize |||/— 6||2 + A||x||i 
subject to y = Ax. 

The elements of A were generated as A^j ~ A/'(0,1)- To construct b we first generated a 
vector V G R”, with elements 

J 0 with probability p = 1/2 

\ A/’(0, l/’^) otherwise. 

We then let b = Av + e, where e represents the noise and was generated as £ 

i A(0,l/4). 

The value of A was set to (l/5)||A^6||oo. This is a reasonable choice since ||A^&||oo is the 
critical value of A above which the solution of the Lasso problem is x = 0. In each instance 
we chose m < n. 


20 




A.5 Logistic regression 


Logistic regression jHTFOQ] fits a probability distribution to a binary class label. Similar to 
the Lasso problem flA.4p a sparsifying l\ penalty is often added to the coefficient vector. It 
can be stated as 


minimize (log(l + exp(a;'^ai)) - hx^ai) + A||a;||i, 

where 6* G {0,1} is the class label of the ith sample, and aj is the ith row of A. The graph 
form representation of this problem is 

minimize (log(l + exp(?/i)) - hyi) + A||x||i, 

subject to y = Ax. 


The elements of A were generated as Aij ~ A/'(0,1). To construct b we first generated a 
vector V G with elements 


J 0 with probability p = 1/2 

[ J\f{0,l/n) otherwise. 


We then constructed the entries of b as 


J 0 with probability p 
\ 1 otherwise. 


1/(1 + exp(-afn)) 


The value of A was set to (1/10)||A'^((1/2)1 — 6)||oo- (||^^((1/2)1 — &)||oo is the critical of A 
above which the solution is a: = 0.) In each instance we chose m > n. 


A.6 Linear program 

Linear programs |BV04j seek to minimize a linear function subject to linear inequality con¬ 
straints. It can be stated as 

minimize 

subject to Ax < b, 

and has graph form representation 

minimize c^x + /(p < &) 
subject to y = Ax. 

The elements of A were generated as Aij ~ A^(0,1). To construct b we first generated a 
vector V G R"", with elements 

Vi ~ 2\A(0,1/n). 

We then generated b as b = Av + e, where e* ~ f/[0,1/10]. The vector c was constructed in 
a similar fashion. First we generate a vector u G R™, with elements 

Ui ~ U[0, 1], 

then we constructed c = —A'^u. This method guarantees that the problem is bounded. In 
each instance we chose m > n. 
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A.7 Non-negative least-squares 


Non-negative least-squares [CPOQ] seeks a minimizer of a least-squares problem subject to 
the solution vector being non-negative. This comes up in applications where the solution 
represents real quantities. The problem can be stated as 

minimize \\Ax — &II2 
subject to a; > 0, 

and has graph form representation 

minimize \\y — h\\\ -\- I{x > 0) 
subject to y = Ax. 

The elements of A were generated as Aij ~ A/'(0,1)- To construct b we first generated a 
vector V G R"", with elements 

Vi ~ l/n). 

We then generated b as b = Av + e, where Si ~ Af{0, 1/4). In each instance we chose m> n. 


A.8 Portfolio optimization 


Portfolio optimization or optimal asset allocation seeks to maximize the risk adjusted return 
of a portfolio. A common assumption is the /c-factor risk model |CK93j . which states that 
the return covariance matrix is the sum of a diagonal plus a rank k matrix. The problem 
can be stated as 

maximize y^x —''fx^{FF^ + D)x 
subject to X > 0, 1^X = 1 

where F G and D is diagonal. An equivalent graph form representation is given by 


minimize 
subject to 


- x'^y -h 'jx^Dx + I{x>0)+ 'jyi.,^yi-.m + I{Vm+i = 1) 


y = 


'F^' 


X. 


The elements of A were generated as Aij ~ A/'(0, 1). The diagonal of D was generated as 
Da ~ 17[0, y/k] and the the mean return y was generated as /r* ~ A/'(0, 1). The risk aversion 
factor 7 was set to 1. In each instance we chose n > k. 


A.9 Support vector machine 

The support vector machine [CV95] problem seeks a separating hyperplane classifier for a 
problem with two classes. The problem can be stated as 

minimize x'^x + max(0, biufx + 1), 
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where bi G {—1, +1} is a class label and aj is the ith row of A. It has graph form represen¬ 
tation 

minimize max(0, yi + 1) + x^x 

subject to y = diag(6)74a;. 

The vector b was chosen to so that the first m/2 elements belong to one class and the second 
m/2 belong to the other class. Specihcally 


h 


-1-1 i < m/2 
—1 otherwise. 


Similarly, the elements of A were generated as 


f Af{+l/n,l/n) i<m/2 
\ J\f{—l/n,l/n) otherwise. 


This choice of A causes the rows of A to form two distinct clusters. In each instance we 
chose m > n. 
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